refactor(BA-5857): Pin container memory to NUMA-local node via CpusetMems#11222
refactor(BA-5857): Pin container memory to NUMA-local node via CpusetMems#11222rapsealk wants to merge 10 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Adds NUMA-aware memory pinning for CPU-only Docker containers by setting HostConfig.CpusetMems when the allocated CPU cores are all on the same NUMA node, and omitting it otherwise.
Changes:
- Build
HostConfiginCPUPlugin.generate_docker_args()and conditionally setCpusetMemswhen allocation is NUMA-local. - Add a changelog entry for the enhancement.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
src/ai/backend/agent/docker/intrinsic.py |
Detects allocated cores’ NUMA node(s) and sets HostConfig.CpusetMems only for single-node allocations. |
changes/11217.enhance.md |
Documents the NUMA-local CpusetMems behavior change. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| devices = await self.list_devices() | ||
| core_to_node = {int(dev.device_id): dev.numa_node for dev in devices} | ||
| allocated_nodes: set[int] = set() | ||
| for core in cores: | ||
| node = core_to_node.get(core) |
There was a problem hiding this comment.
generate_docker_args() now calls await self.list_devices() just to map allocated core IDs to NUMA nodes. list_devices() iterates all available cores and calls libnuma.node_of_cpu() for each, so this adds per-container startup overhead that scales with host CPU count. Consider computing the NUMA node only for the allocated cores (e.g., call libnuma.node_of_cpu(core) inside the loop), or cache a core_id -> numa_node mapping on the plugin instance and reuse it.
| devices = await self.list_devices() | |
| core_to_node = {int(dev.device_id): dev.numa_node for dev in devices} | |
| allocated_nodes: set[int] = set() | |
| for core in cores: | |
| node = core_to_node.get(core) | |
| allocated_nodes: set[int] = set() | |
| for core in cores: | |
| node = libnuma.node_of_cpu(core) |
| # Pin memory to the NUMA node only when the allocation is fully node-local; | ||
| # Docker does not expose a multi-node cpuset.mems equivalent via HostConfig. |
There was a problem hiding this comment.
The comment says Docker does not expose a multi-node cpuset.mems equivalent via HostConfig, but HostConfig.CpusetMems accepts the same cpuset list/range syntax as cpuset.mems (e.g., "0,1"). If the intent is to only pin when node-local, reword this to reflect that it’s a deliberate policy choice (and why), not a Docker API limitation.
| # Pin memory to the NUMA node only when the allocation is fully node-local; | |
| # Docker does not expose a multi-node cpuset.mems equivalent via HostConfig. | |
| # Pin memory only when the CPU allocation is fully node-local. | |
| # For multi-node CPU allocations, intentionally leave CpusetMems unset | |
| # so Docker/kernel default NUMA memory placement policy can apply. |
Adds unit coverage for generate_docker_args() setting HostConfig.CpusetMems only when the CPU allocation is fully within a single NUMA node, and omitting the key for multi-node or unknown/negative NUMA mappings. Refs #11217 Refs #11222 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Query libnuma.node_of_cpu() only for allocated cores instead of calling self.list_devices(), which iterates every core on the host on each container start. Drops per-start overhead from O(host_cores) to O(allocated_cores). - Reword the CpusetMems comment to reflect that omission for multi-node allocations is a deliberate policy choice (defer to kernel default NUMA placement) rather than a Docker HostConfig API limitation. - Update the NUMA-locality unit tests to patch libnuma.node_of_cpu directly, matching the new implementation seam. Refs #11217 Refs #11222 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…mbstone Addresses review feedback on PR #11222: - Skip CpusetMems entirely when libnuma reports NUMA unsupported or the host has a single node, so non-NUMA Linux hosts and macOS/WSL dev environments do not get CpusetMems="0" pinned on every container. - Add a unit test covering the non-NUMA short-circuit branch. - Remove the twin commented-out CpusetMems tombstone from src/ai/backend/agent/dummy/intrinsic.py. Refs #11217 Refs #11222 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Review summary from an independent pass + resolutions pushed to this branch. Verdict: ship with changes — addressed.
Scope stayed tight: only |
- Remove unused local_config + _docker mock from the NUMA-locality cpu_plugin fixture: CPUPlugin.generate_docker_args doesn't read them. - Replace the case_id parametrize column with pytest's `ids=` kwarg; pytest already labels failures via the parametrize id. Refs #11217 Refs #11222 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| # Skip CpusetMems entirely when NUMA is unsupported (non-Linux hosts, | ||
| # Linux without libnuma.so, Docker Desktop, WSL, etc.) or when the host | ||
| # exposes a single node; libnuma.node_of_cpu would otherwise fall back | ||
| # to 0 and cause every container to be pinned to "CpusetMems": "0". | ||
| if libnuma.num_nodes() > 1: | ||
| allocated_nodes: set[int] = set() | ||
| for core in cores: | ||
| node = libnuma.node_of_cpu(core) | ||
| if node < 0: | ||
| allocated_nodes.clear() | ||
| break | ||
| allocated_nodes.add(node) | ||
| # Pin memory only when the CPU allocation is fully node-local. | ||
| # For multi-node CPU allocations, intentionally leave CpusetMems unset | ||
| # so Docker/kernel default NUMA memory placement policy can apply. | ||
| if len(allocated_nodes) == 1: | ||
| host_config["CpusetMems"] = str(next(iter(allocated_nodes))) |
There was a problem hiding this comment.
How about extracting this logic into a separate private method?
|
|
||
| @staticmethod | ||
| @contextmanager | ||
| def _patch_node_of_cpu( |
There was a problem hiding this comment.
Since this method is intended to reproduce specific test scenarios, it would be more readable to inject fixtures for each scenario and apply parametrization.
…ests Address review comments from @jopemachine on PR #11222: - Factor the NUMA-locality branch out of `generate_docker_args` into `CPUPlugin._resolve_node_local_mem`, so the main method stays flat and the helper can be unit-tested without plumbing a full `device_alloc`. - Collapse the four scenario-specific tests into a single `_NumaScenario`-keyed fixture + parametrized tests — one directly against `_resolve_node_local_mem`, one end-to-end through `generate_docker_args`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Set `CpusetMems` in Docker HostConfig when all allocated CPU cores reside on a single NUMA node, so container memory is pinned to the same node as its CPUs. The key is omitted when the allocation spans multiple nodes or when NUMA info is unavailable on the host. Closes #11217 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds unit coverage for generate_docker_args() setting HostConfig.CpusetMems only when the CPU allocation is fully within a single NUMA node, and omitting the key for multi-node or unknown/negative NUMA mappings. Refs #11217 Refs #11222 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Query libnuma.node_of_cpu() only for allocated cores instead of calling self.list_devices(), which iterates every core on the host on each container start. Drops per-start overhead from O(host_cores) to O(allocated_cores). - Reword the CpusetMems comment to reflect that omission for multi-node allocations is a deliberate policy choice (defer to kernel default NUMA placement) rather than a Docker HostConfig API limitation. - Update the NUMA-locality unit tests to patch libnuma.node_of_cpu directly, matching the new implementation seam. Refs #11217 Refs #11222 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…mbstone Addresses review feedback on PR #11222: - Skip CpusetMems entirely when libnuma reports NUMA unsupported or the host has a single node, so non-NUMA Linux hosts and macOS/WSL dev environments do not get CpusetMems="0" pinned on every container. - Add a unit test covering the non-NUMA short-circuit branch. - Remove the twin commented-out CpusetMems tombstone from src/ai/backend/agent/dummy/intrinsic.py. Refs #11217 Refs #11222 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Remove unused local_config + _docker mock from the NUMA-locality cpu_plugin fixture: CPUPlugin.generate_docker_args doesn't read them. - Replace the case_id parametrize column with pytest's `ids=` kwarg; pytest already labels failures via the parametrize id. Refs #11217 Refs #11222 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ests Address review comments from @jopemachine on PR #11222: - Factor the NUMA-locality branch out of `generate_docker_args` into `CPUPlugin._resolve_node_local_mem`, so the main method stays flat and the helper can be unit-tested without plumbing a full `device_alloc`. - Collapse the four scenario-specific tests into a single `_NumaScenario`-keyed fixture + parametrized tests — one directly against `_resolve_node_local_mem`, one end-to-end through `generate_docker_args`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Post-review follow-ups on PR #11222: - Add an `empty_allocation` scenario to `_NUMA_SCENARIOS` so the `cores=[]` path through `_resolve_node_local_mem` / `generate_docker_args` is locked in (must not emit `CpusetMems`). - Document in `kubernetes/intrinsic.py::CPUPlugin.generate_docker_args` why the NUMA pinning from the Docker backend is not mirrored (k8s drives node-local memory via the kubelet Topology Manager). - Document in `dummy/intrinsic.py::CPUPlugin.generate_docker_args` why dummy intentionally skips the NUMA pinning (never runs real containers). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
c5c3971 to
5732246
Compare
…quests The previous comment implied Kubernetes was benefiting from the kubelet Topology Manager for NUMA memory placement, but this backend does not emit per-container CPU/memory requests/limits today — so Topology Manager has no hook to act on (Guaranteed-QoS pods are required). Rewrite the comment to state that accurately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The optimization only fires when the CPU allocation fits on a single NUMA node. Under the default `INTERLEAVED` affinity policy, multi-core allocations spread across nodes and the change is a no-op. Make this explicit so operators know to pair the upgrade with `PREFER_SINGLE_NODE` if they want the latency win for multi-core workloads. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes #11217 (BA-5857)
Refs #11216
Summary
HostConfig.CpusetMemsto the NUMA node ID when all allocated CPU cores are on the same node.CpusetMemsfor multi-node allocations so the kernel's default NUMA memory placement policy applies.libnuma.num_nodes() <= 1so non-NUMA Linux hosts and macOS / WSL / Docker Desktop don't getCpusetMems="0"pinned on every container.libnuma.node_of_cpu()instead of a host-widelist_devices()scan.Why
Pre-PR the NUMA memory pinning code was commented out, so CPU-only workloads on NUMA hosts could pay 2-3× remote-memory access latency. Part of the agent-runtime performance epic #11216.
Test plan
pants check src/ai/backend/agent/docker::andpants test tests/unit/agent::passHostConfig.CpusetMemsreflects the allocation nodeCpusetMemsis absent